48 research outputs found

    Towards The Efficient Use Of Fine-Grained Provenance In Datascience Applications

    Get PDF
    Recent years have witnessed increased demand for users to be able to interpret the results of data science pipelines, locate erroneous data items in the input, evaluate the importance of individual input data items, and acknowledge the contributions of data curators. Such applications often involve the use of the provenance at a fine-grained level, and require very fast response time. To address this issue, my goal is to expedite the use of fine-grained provenance in applications within both the database and machine learning domains, which are ubiquitous in contemporary data science pipelines. In applications from the database domain, I focus on the problem of data citation and provide two different types of solutions, Rewriting-based solutions and Provenance-based solutions, to generate fine-grained citations to database query results by implicitly or explicitly leveraging provenance information. In applications from the ML domain, the first considers the problem of incrementally updating ML models after the deletions of a small subset of training samples. This is critical for understanding the importance of individual training samples to ML models, especially in online pipelines. For this problem, I provide two solutions, PrIU and DeltaGrad, to incrementally update ML models constructed by SGD/GD methods, which utilize provenance information collected during the training phase on the full dataset before the deletion requests. The second application from the ML domain that I focus on is to explore how to clean label uncertainties located in the ML training dataset in a more efficient and cheaper manner. To address this problem, I proposed a solution, CHEF, to reduce the cost and the overhead at each phase of the label cleaning pipeline and maintain the overall model performance simultaneously. I also propose initial ideas for how to remove some assumptions used in these solutions to extend them to more general scenarios

    MDB: Interactively Querying Datasets and Models

    Full text link
    As models are trained and deployed, developers need to be able to systematically debug errors that emerge in the machine learning pipeline. We present MDB, a debugging framework for interactively querying datasets and models. MDB integrates functional programming with relational algebra to build expressive queries over a database of datasets and model predictions. Queries are reusable and easily modified, enabling debuggers to rapidly iterate and refine queries to discover and characterize errors and model behaviors. We evaluate MDB on object detection, bias discovery, image classification, and data imputation tasks across self-driving videos, large language models, and medical records. Our experiments show that MDB enables up to 10x faster and 40\% shorter queries than other baselines. In a user study, we find developers can successfully construct complex queries that describe errors of machine learning models

    Data Citation: A New Provenance Challenge

    Get PDF

    Dynamic Gaussian Mixture based Deep Generative Model For Robust Forecasting on Sparse Multivariate Time Series

    Full text link
    Forecasting on sparse multivariate time series (MTS) aims to model the predictors of future values of time series given their incomplete past, which is important for many emerging applications. However, most existing methods process MTS's individually, and do not leverage the dynamic distributions underlying the MTS's, leading to sub-optimal results when the sparsity is high. To address this challenge, we propose a novel generative model, which tracks the transition of latent clusters, instead of isolated feature representations, to achieve robust modeling. It is characterized by a newly designed dynamic Gaussian mixture distribution, which captures the dynamics of clustering structures, and is used for emitting timeseries. The generative model is parameterized by neural networks. A structured inference network is also designed for enabling inductive analysis. A gating mechanism is further introduced to dynamically tune the Gaussian mixture distributions. Extensive experimental results on a variety of real-life datasets demonstrate the effectiveness of our method.Comment: This paper is accepted by AAAI 202

    Identification of Prognostic Genes and Pathways in Lung Adenocarcinoma Using a Bayesian Approach

    Get PDF
    Lung cancer is the leading cause of cancer-associated mortality in the United States and the world. Adenocarcinoma, the most common subtype of lung cancer, is generally diagnosed at the late stage with poor prognosis. In the past, extensive effort has been devoted to elucidating lung cancer pathogenesis and pinpointing genes associated with survival outcomes. As the progression of lung cancer is a complex process that involves coordinated actions of functionally associated genes from cancer-related pathways, there is a growing interest in simultaneous identification of both prognostic pathways and important genes within those pathways. In this study, we analyse The Cancer Genome Atlas lung adenocarcinoma data using a Bayesian approach incorporating the pathway information as well as the interconnections among genes. The top 11 pathways have been found to play significant roles in lung adenocarcinoma prognosis, including pathways in mitogen-activated protein kinase signalling, cytokine-cytokine receptor interaction, and ubiquitin-mediated proteolysis. We have also located key gene signatures such as RELB, MAP4K1, and UBE2C. These results indicate that the Bayesian approach may facilitate discovery of important genes and pathways that are tightly associated with the survival of patients with lung adenocarcinoma

    Influence of turbid flood water release on sediment deposition and phosphorus distribution in the bed sediment of the Three Gorges Reservoir, China

    Get PDF
    Excessive phosphorus (P) loading was identified as an urgent problem during the post-Three Gorges Reservoir (TGR) period. Turbid water with high suspended sediment loads has been periodically released during the flood season to mitigate sediment deposition in the TGR, but limited attention has been paid to its effect on the distribution of P in bed sediment within the reservoir. In this study, field surveys, historical monitoring data related to sediment deposition, and physiochemical properties and the fractional P content in the mainstream surface sediment and representative column sediment, were used to investigate the effect of turbid flood water release on P distribution in bed sediment. The results revealed that turbid flood water release could discharge approximately 20% of the suspended sediment inflow entering the TGR. Additionally, both the particle size of the inflow sediment and suspended sediment flux tended to decline, and the deposited sediment volume tended to constantly increase in the TGR at a rate of 0.117 billion tonnes per year between 2004 and 2016. The median particle size (MPS) was larger for surface sediment obtained in the flood season than for that obtained in the dry season, and the MPS tended to increase with an increase in the sediment depth from 0 to 20 cm. The total phosphorus (TP) content in sediment ranged from 2.6% to 17.5% lower in the flood water releasing period than in the non-flood water storing period. However, no consistent variation was detected for the vertical distribution of P fraction in the top 20 cm of bed sediment. Compared with lakes with slow deposition rates, the TGR showed a rapid sedimentation rate of >1.0 m/y, which mostly resulted in the uniform distribution of the surface sediment P fraction
    corecore